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Socializing Data 


Diane Coyle 


Will the proliferation of data enable AI to deliver progress ? An ever-growing swath 
of life is available as digitally captured and stored data records. Effective govern- 
ment, business management, and even personal life are increasingly suggested to 
be a matter of using AI to interpret and act on the data. This optimism should be 
tempered with caution. Data cannot capture much of the richness of life, and while 
Al has great potential for beneficial uses, its delivery of progress in any human sense 
will depend on not using all the data that can be collected. Moreover, the more dig- 
ital technology rewires society, creating opportunities for the use of big data and AI, 
the greater the need for trust and human deliberation. 


ata have always been important for government and policy. Statistics are, 
as the name suggests, categorized data useful for states.’ States have col- 
lected and collated data for centuries, not least for the purposes of taxa- 
tion. Censuses too are ancient, defining the boundaries of power, though they are 
likely to be replaced by other government-collected data sets about individuals.” 
The purpose of governmental measurement is to create conceptual order, to clas- 
sify the vast array of possible data points into meaningful categories, enabling bet- 
ter decisions. Over the quarter-millennium of modern economic growth, the scope 
of data collection and processing into statistics has become increasingly extensive. 
In Seeing like a State (1998), political scientist James Scott argues that modern 
states classify reality to improve the legibility of what they govern, to better control 
it. He writes: “Legibility implies a viewer whose place is central and whose vision 
is synoptic. . . . This privileged vantage point is typical of all institutional settings 
where command and control of complex human activities is paramount. ”3 Many 
of his examples of states bending reality into order concern economic activities 
such as forestry or agriculture, with reality conforming increasingly to the clas- 
sifications devised to understand it. There is a feedback loop whereby statistics 
collect and classify data points found in the wild, then subsequently influence ac- 
tivities and shape reality over time, so that future data will be more likely to fit into 
the predefined categories. This has been described by statistician André Vanoli 
as “the dialectic of appearance and reality.”4 Or as historian Theodore Porter put 
it, “The quantitative technologies used to investigate social and economic life al- 
ways work best if the world they aim to describe can be remade in their image.”5 
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For example, the principal measure of economic progress since the early 1940s 
has been gross domestic product (GDP). Governments gear their policies toward 
increasing GDP, and people duly respond to the incentives created by policies 
such as tax breaks, subsidies, public infrastructure investment, or cheaper meals 
out.” Disappointing statistics can topple governments, as they did with the UK La- 
bour government of the late 1970s, paving the way for the Thatcherite revolution. 
GDP has not been a terrible metric for progress: compared with previous genera- 
tions, our living standards are without doubt higher. We have better health, more 
leisure, more comfortable homes, and the convenience of many new technolo- 
gies. Yet even at the dawn of GDP’s invention, some realities had to be bent to fit 
the statistical framework. Some were rendered invisible, defined as being outside 
“the economy,” such as household work and nature. Without nature, there is no 
economy and yet the consequences for sustainability of this fateful definitional 
choice are becoming all too clear, and the progress we thought we had is at least 
partly illusory. 

Reality and the statistical picture also diverge when reality is changing. As stat- 
istician Alain Desrosieres has written, “In its innovative phase, industry rebels 
against statistics because, by definition, innovation distinguishes, differentiates 
and combines resources in an unexpected way. Faced with these ‘anomalies,’ the 
statistician does not know what to do.” At present, for official statisticians, life is 
one damned anomaly after another. For just as agriculture’s share was overtaken 
by manufacturing in the industrial revolution, the material economy is smaller 
now relative to the dematerializing economy of digitally enabled services.’ The 
statistical categories no longer fit well. Paradoxically, in the economy of ever more 
data, it is proving increasingly difficult to bring informational order, for the state 
to gain that desired legibility. 


his is a paradox because the promise of big data and its use in AI has in- 

spired renewed visions inside government of enhanced legibility. Such vi- 

sions are not new. From the late 1950s onward, computers have seemed to 
promise a clearer, synoptic understanding of society.'° One ambitious 1970s proj- 
ect was Project Cybersyn in Salvador Allende’s Chile, administered by cyberneti- 
cist Stafford Beer, which was intended to implement an efficiently planned econ- 
omy.” A similar vision of data-enabled, improved legibility has revived in the big 
data digital era. On the left of UK politics it found expression as “fully automat- 
ed luxury communism.” In the UK Conservative government elected in 2019, it 
took physical shape as a control room at the heart of government, and a UK Strate- 
gic Command contract with tech firm Improbable to build a “digital twin,” a sim- 
ulation of the whole of Britain.” The fact that both ends of the political spectrum 
envision data-driven efficiency suggests a big data rerun of the 1930s socialist cal- 
culation debate.'4 
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The thing that is seen in seeing rooms of these kinds — physical rooms with dis- 
plays of information to inform decision-makers — is ordered data. There is a kind 
of commodity fetishism regarding the mechanics of displaying the data. The tech- 
nology of data has long been glamorous, arousing intense public and political in- 
terest. The great exhibitions and world’s fairs of the nineteenth and early twenti- 
eth centuries had popular displays of high-tech data management artifacts such as 
filing cabinets and cards.* The same is true of digital technology and Silicon Val- 
ley, which have inspired numerous nonfictional and fictional accounts. Databases 
have changed form over time as physical hardware and computational power have 
evolved, so the embodiment and the usability (searchability) of data have not 
been constant, and the technologies of display combined with the classification 
and conceptual framework organizing the data affect the way decision-makers 
understand the world. The emphasis on the synoptic view — through a computer 
simulation, through a room kitted out with the latest screens and data feeds — is 
an assertion of political control through greater legibility. Then—UK government 
adviser Dominic Cummings presented it as a matter of public interest: 


There is very powerful feedback between: a) creating dynamic tools to see complex 
systems deeper (to see inside, see across time, and see across possibilities), thus mak- 
ing it easier to work with reliable knowledge and interactive quantitative models, 
semi-automating error-correction etc, and b) the potential for big improvements in 


the performance of political and government decision-making.'© 


In other words, the claim is that data science and AI, suitably embodied in a seeing 
room, can be the vehicle for delivering “high performance” by government. 

However, the emphasis is on the technologies of cognition and management, 
rather than the construction of the data going into the process, or the assessment 
of what constitutes improvement. The implicit assumption is that this is a determi- 
nation made by the center, by those in the seeing room. This assumption is exactly 
why an ambition to use data for progress can embed biases, create ambiguity about 
accountabilities, or appear to be part of the surveillance society.’” There is certain- 
ly nothing new about state attempts to exercise comprehensive surveillance. East 
Germany’s Stasi offers an extreme recent example. Its data took analog form witha 
technological infrastructure turning data into seeing: card records with a bespoke 
filing cabinet technology, photographs, steam irons for opening mail, tape record- 
ings, and computers. Despite the existence of formal regulations controlling ac- 
cess to this data, a citizen of the former German Democratic Republic was a gldserne 
Mensch, a transparent being. Perhaps we are all becoming transparent now. Digital 
technology makes the amassing of data records trivially cheap and easy by compar- 
ison with the 1980s, and security agencies have been doing this at scale. 

Big tech companies, not just security agencies, have been amassing the biggest 
and best databases and the know-how to use them for a purpose. Their purpose is 
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profit, rather than public good, and their market power ensures they do not need 
to serve the interests of their users or the public in general. Big Tech’s success vis- 
a-vis state power is amply evident in the erosion of national tax bases as ever more 
economic activity goes online. It is not clear how much governments can limit 
this.18 As being able to raise tax revenues is a core state function, there can be lit- 
tle question about the power of the biggest digital companies. If the synoptic view 
of what is happening anywhere is available to anybody, it is Google or Facebook. 
They, not officials or politicians, are collecting, categorizing, and using the new 
proliferation of data. 

As long as data are seen as individual property amenable to normal market ex- 
change, that will continue to be the case, despite recent regulatory moves in sev- 
eral jurisdictions to enforce some data sharing by the tech giants. The reason big 
tech companies have been able to acquire their power is the prevailing concep- 
tual framework, crystallized into law, for understanding data as property. Rather 
than the appreciation that data reflect constructed categories, a particular lens or 
framework measuring and shaping reality, data are seen as the collection of nat- 
ural objects: the classifications codified and programmed into data feeds just are 
what they are. These constructed data records are then subject to legal rules of 
ownership. Data are presumed to be transferred and owned by corporations as 
soon as the user of a service has accepted its terms and conditions. 

The consequences of this property rights concept applied to data, or informa- 
tion, illuminate why it is so pernicious. For example, John Deere and General Mo- 
tors (as corporate persons) have claimed in U.S. copyright courts that farmers or 
drivers who thought they were purchasing their vehicles do not in fact own them 
and have no right to repair them. The company’s reasoning is that a tractor is no 
longer mainly a metal object whose ownership as a piece of property is transferred 
from John Deere to the farmer, but rather an intangible data-fed software service 
licensed from the company, which just happens to have a tractor attached.’? Indeed, 
screens with data about weather, soil conditions, and seed flow proliferate inside 
tractor cabins and feed into the diagnostic software installed by the manufacturer, 
which provides information to enable decisions raising crop yields. The John Deere 
claim to ownership of the intangible dominates the farmer’s claim to ownership of 
the physical vehicle it is bundled with. To date, the courts have been largely sympa- 
thetic to the corporations and to the strong ownership claims made by Amazon over 
e-books, by makers of games on consoles, as well as by vehicle manufacturers. 

One response to such corporate ownership over data and data processing claims 
has been the demand for corporations to pay for “data as labor.””° With this, each 
data point an online business collects from users’ activities would be rewarded 
with a small financial payment. However, as economist Zoé Hitzig and colleagues 
point out, this remedy also considers data as a transferrable, individual item of 
property, and implicitly as a natural object “given” by the underlying reality.” 
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The data-as-property perspective assumes data are an object in the world, with 
an independent reality, differing from other givens only in being intangible. Yet 
not only are data nonrival (their use does not deplete them so many people can use 
them), but they are also inherently relational. Data are social. Even when it comes 
to data that are seemingly ultrapersonal — for example, that I passed a particular 
facial recognition camera at a given moment - the information content and use- 
fulness of the data are always relational.”” A facial image needs to be compared 
with a police database. Even then its utility for the purpose of detecting suspected 
criminals depends on the quality of the training data used to build the machine 
learning algorithm, including its biases, the product of a long history of unequal 
social relations. The relational character of data means they are both constructed 
by social relations and a collective resource for which market exchange will not be 
the best form of organization.” Indeed, this is why there are few markets for data; 
where data are sold — for example, by credit rating agencies — the market is gener- 
ally thin, with no standardized, posted prices. The use value of data — their infor- 
mation content enabling decisions to be made - is highly heterogeneous. 


hat markets are a poor organizational model for the optimal societal use 

of data is Economics 101. Does that make government the right vehi- 

cle to use big data and AI for the public good? Can and should govern- 
ments aim to beat big tech at the seeing game? The promise of automating policy 
through seeing rooms and use of AI is greater efficiency and, potentially, better 
outcomes. Yet there is increasing use of algorithmic processes in arenas in which 
decisions could have a large impact on people’s lives, such as criminal justice or 
social security. 

Much of the literature on the informational basis of organizations focuses on 
complexity as the constraint on effective information-processing, given an objec- 
tive function.”4 Automation is superior in routine contexts: more reliable, more 
accurate, faster, and cheaper. What is more, machines deal more effectively with 
data complexity than humans do, given our cognitive limitations. This is a key ad- 
vantage of machine learning systems as the data environment grows more com- 
plex. The system is better able than any human to discern patterns and statisti- 
cal relationships in the data, and indeed the more complex the environment, the 
greater the AI advantage over human-scale methods. However, whenever there is 
uncertainty, the advantage tips back to humans. The more frequently the environ- 
ment changes in unexpected ways, or the more dramatic the scale of change, the 
greater the benefits of applying human judgment. The statistical relationships on 
which automated decision rules are based will break down in such circumstances 
(in economics this is known as the Lucas critique).”> The selection of a machine 
or human to make decisions is generally presented as a trade-off. However, it has 
long been argued, or hoped, that AI can improve the terms of this trade-off.” 
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There are several reasons to doubt this. One is the well-known issue of bias 
in training data sets, the inevitable product of unfair societies in the way data are 
classified, constructed, collected, and ordered.*” Any existing data set reflects 
both the classification framework used and the way that framework has shaped 
the underlying reality over time (that is, André Vanoli’s dialectic referred to ear- 
lier). The data science community has become alert to this challenge and many 
researchers are actively working on overcoming the inevitable problems raised by 
data bias. But bias is not the only issue. 

Another less well-recognized issue (at least in the policy world) is that deci- 
sions based on machine learning need an explicitly coded objective function. Yet 
in many areas of human decision-making — particularly the most sensitive, such 
as justice or welfare — objectives are often left deliberately implicit. Politics in de- 
mocracies requires compromise on high-level issues so that low-level actions can 
be taken. These “incompletely theorized agreements” are not amenable to be- 
ing encoded in machine learning (ML) systems, in which precision about the re- 
ward function is needed even if conflicting objectives are combined with different 
weights.”8 The further deployment of ML in applied policy practice may require 
more explicit statements of objectives or trade-offs, which will be challenging in 
any domain where people’s views diverge.*? There could be very many of these, 
even in policy areas that seem straightforward. For example, how should public 
housing be allocated? There has been a pendulum swinging over time between 
allocation based on need and allocation based on likelihood to pay rent. These are 
conflicting objectives, and yet many of the same families would be housed under 
either criterion. 

The extensive discussions of value alignment in the AI literature tussle with 
how to combine the brutally consequentialist nature of AI with ambiguity or con- 
flicts about values. Given any objective or reward function, ML systems will game 
their targets far more effectively than any bureaucrat ever did. All the critiques of 
target setting in the public management literature, on the basis that officials game 
these for their personal objectives, apply with extra force to systems automating 
target delivery. This has led to concerns — albeit overstated — about runaway out- 
comes far from what the human operators of the system wanted.3° One possible 
avenue is inverse reinforcement learning — that is, when ML systems try to infer 
what they should optimize for - which can accommodate uncertainty about the 
objective, but takes the existing environment as the desired state of affairs.3" Polit- 
ical theorist and ethicist Iason Gabriel rightly emphasizes the need for legitimate 
societal processes to enable value alignment; but we do not have these yet.3” 


arket arrangements based on the concept of private property transac- 
tions are inappropriate for data, given their relational characteristics. 
In economic terms, there are large externalities, whereby one individ- 
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ual’s provision of data can have either negative (loss of privacy) or positive (use- 
ful information) implications for other people. Rather than being considered as 
property amenable to market exchange, data instead need to be subject to gover- 
nance arrangements of permitted access and use. Online, the offline norms of so- 
ciologist Georg Simmel’s concept of “privacy in public” do not exist.34 This con- 
cept refers to the norms people adopt limiting what they know about each other 
in their different roles. Even publicly available information (such as where some- 
body lives) is not made known in a specific context (such as the marking of an 
exam paper by their lecturer). These voluntary informational restraints and social 
relations of trust play an important role in sustaining desirable outcomes such as 
fairness, privacy, or self-esteem.%5 Similar norms do not exist online. Big tech joins 
up too many data about each of us. People can reasonably be concerned about 
government seeing rooms doing the same. 

Atthe same time, some joining up of data for some uses could without question 
lead to improved outcomes for individuals. So we have ended up in the worst of 
all worlds: a “surveillance state” or “surveillance economy” in which valid priva- 
cy concerns about certain data uses prevent other uses of “personal” data for col- 
lective and individual good. Consider the successful argument that governments 
should not use data from COVID-19 apps to trace individuals’ contacts during the 
pandemic, leading almost all governments to adopt the Google and Apple appli- 
cation programming interfaces (API) with privacy enforced, all the while as per- 
sonal liberty was infringed through lockdowns tougher than would have been 
needed with effective contact tracing. Meanwhile, governments and researchers 
have been able to use big data and machine learning to inform policies during the 
pandemic but could have done much more to avert unequal health outcomes with 
linked data about individuals’ health status, location, employment, ethnicity, and 
housing. 

The debate about privacy has become overly focused on individual consent 
and data protection. It should be a debate about social norms and what is accept- 
able in different contexts, translated into rights of access and use for limited, spe- 
cific purposes.3° In both the commercial and the public sphere, the promise of AI 
for decision-making will not be realized unless the kind of information norms 
that operate offline are created online. The control of access and use is not just a 
technical issue but a social and political one. 


s the world gets both more complex and more uncertain, big data and AI 
will need to socialize in another way, by combining with human judg- 
ment more often. The experiences of 2020, or the impact of extreme 
climate-related events from California burning to Texas freezing, are suggestive 
of the prospect that “radical uncertainty” will characterize the twenty-first centu- 
ry.” Anybody with any knowledge of forecasting (no matter how small or big the 
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data set) will know that uncertainty about future outcomes multiplies over time. 
“Further computational power doesn’t help you in this instance, because uncer- 
tainty dominates. Reducing model uncertainty requires exponentially greater 
computation.”38 

As radical uncertainty increases, the digital transformation is meanwhile 
expanding the domain of human judgment and trust. Institutional econom- 
ics has generally considered two modes of organization: the market, in which 
price allocates resources, and the hierarchy, in which authority and contract 
apply. But neither price nor authority function well as allocation mechanisms 
when knowledge-based assets are important in production.39 That the market 
is a poor vehicle for the use and provision of public goods such as knowledge 
is a standard piece of economic theory. Similarly, a large body of management 
literature notes that knowledge is hoarded at the top of hierarchical organiza- 
tions, which are consequently good at routine activities but not at adaptation or 
innovation. 

Trust is a more effective mechanism than either market exchange or com- 
mand-and-control for coordinating knowledge-intensive activities, both within 
organizations and between them. The economics literature has long recognized 
the challenge of asymmetric information and tacit knowledge.4° In the digital 
knowledge economy, tacit or hard-to-codify knowledge is increasingly impor- 
tant. For example, the advantage of high productivity firms over others is encap- 
sulated in the concept of their “organizational capital.” It reflects their ability to 
manage a complex and uncertain environment, make use of data and software, 
and employ skilled people who have the authority to make decisions. The gap be- 
tween firms with high organizational capital and others is growing.*" Trust net- 
works or communities need to join market and hierarchy as a standard organiza- 
tional form. Trust is also essential when questions of accountability are blurred, 
as is the case with hard-to-audit automated-decision systems; the alternative is 
costly insurance and/or litigation to assign responsibility for outcomes. 

The desire for the seeing room view rests on an assumption about the possi- 
bility of classifying the world and ordering data as statistical inputs for that syn- 
optic view. Big data does not help overcome the limitations of having to impose 
a classification: AI techniques involve the aggregation of the vast quantities of 
raw, irregular, often by-product data into lower dimensional constructs. The ma- 
chine is doing the classification in ways not legible to humans, but it is doing so 
nonetheless. But there is much useful knowledge that is tacit rather than explicit 
and therefore impossible to classify. There is much that is highly locally heteroge- 
nous such that population averages mislead. Nor does having big data and AI over- 
come the inevitable clash of values or interests that arise in any specific decision- 
making context. Algorithms cannot adjudicate trade-offs and conflicts; only hu- 
mans can do so with any legitimacy. 
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We should think of machines and humans as complements. As societal com- 
plexity and uncertainty increase, and as the zone of automated decisions expands, 
this requires more use of human judgment, not less. Otherwise, we will end up 
with Scott’s disasters of modernism, fully automated. Practical, tacit, improvisa- 
tional knowledge and informal decision-making processes are always essential for 
actions to deliver better outcomes locally: even setting aside the point that peo- 
ple might have different and irreconcilable views about what constitutes “better,” 
there are limits to classifiable knowledge, and limits to data. 


he use of AI in society must reflect the social nature of data. Although big 

data offers great potential for progress, any data set is a limited, encod- 

ed representation of reality, embedding biases and assumptions, and ig- 
noring information that cannot be codified. A synoptic view of society from a 
data-enabled seeing room is impossible because no authority can stand outside 
the reality their decisions will in fact shape. For the promise of AI to be realized, 
three things are needed: new norms (as well as laws and technologies) governing 
access and use of data, embedding offline limits online; effective organizations 
empowering human judgment alongside automated decisions; and legitimate 
processes to shape the collective decisions being coded into AI. Adopting AI first 
and reflecting on these needs later is the wrong way to go about socializing data. 


AUTHOR’S NOTE 


My thanks to the following colleagues for their helpful comments on an early draft: 
Vasco Carvalho, Verity Harding, Bill Janeway, Michael Kenny, Neil Lawrence, and 
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